Is the Random Tree Puzzle process the same as the Yule-Harding process? 
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It has been suggested that a Random Tree Puzzle (RTP) process leads to a Yule-Harding (YH) distribution, when the 
number of taxa becomes large. In this study, we formalize this conjecture, and we prove that the two tree distributions 
converge for two particular properties, which suggests that the conjecture may be true. However, we present evidence 
that, while the two distributions are close, the RTP appears to converge on a different distribution than does the YH. 



1. Introduction 

The Maximum likelihood (ML) approach (Felsenstein 
1981; Guindon and Gascuel 2003; Guindon et al. 2010) 
is generally considered to be a reliable way of estimating 
phylogenies from DNA sequences. However, ML is not 
always feasible for large numbers of species, because of 
the intensive computation required. Methods that use 'four 
point subsets' (Dress et al. 1986) reduce the complexity of 
the problem, and have assisted numerous studies. (Daubin 
and Ochman 2004; Nieselt-Struwe and von Haeseler 2001; 
Strimmer et al. 1997; Strimmer and von Haeseler 1996). 

The four points subtree is known as the quartet tree. 
Quartet puzzling (QP) (Strimmer and von Haeseler 1996) 
is an algorithm to infer a tree on n taxa by using the quar- 
tet trees derived from DNA sequences. It firstly computes 
the likelihood of all quartets. As there are three pos- 
sible topologies for any four taxa, the quartet tree which 
returns the greatest ML value is used (any ties are bro- 
ken uniformly at random). At the puzzling step, the order 
of inserting new leaf nodes is randomized. A seed tree is 
built from the first four elements of the ordered leaf node 
sequence. From this point on, leaves are attached sequen- 
tially by the following procedure: when a new leaf x is to 
be attached to the existing tree T, quartet trees are built 
from quartets formed from x and all subsets of size three 
are chosen from the existing leaf set. If the ML quartet tree 
of {i,j,k,x} is ij\kx, then weight 1 is added to the edges 
on the path in T connecting the two leaves i and j. This 
process is repeated for all such quartet trees, and x is then 
attached to the edge which has the minimal weight. An ex- 
ample is given in FigureQ] 
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FIG. 1 . — Suppose leaf F is about to be attached to the five-taxon tree 
on the left, and the ML trees of {i,j,k,F} are AB\CF, AC\EF, BC\DF, 
AC\DF, AB\DF, AD\EF, AB\EF, BC\EF, BD\EF, and CE\DF. The ex- 
ternal edge leading to E returns the minimal weight, so F is attached to 
this edge, leading to the six-taxon tree shown shown on the right. 



Since the order of adding leaves is randomized, this 
can lead to variation in the resulting tree topologies, and 
so a consensus tree of numerous replicates is used as the 
output tree. The program Tree-puzzle (TP) (Schmidt et al. 
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2002) is a parallel version of QP, which performs indepen- 
dent puzzling steps simultaneously. 

The trees generated by either the QP or TP process 
depend on the biological sequences we have for the taxa. 
To investigate how the TP process behaves on randomized 
quartets, Vinh et al. (2011) performed a simulation study 
on a so-called random tree puzzle (RTP) process. This as- 
sumes that no prior molecular information is given. There- 
fore, for the same quartet set, all three tree topologies are 
equally likely. The authors compare the empirical proba- 
bilities of tree topologies against the theoretical probabili- 
ties from the proportional to distinguishable arrangement 
(PDA) model and the Yule-Harding (YH) model. Table 1 
from Vinh et al. (2011) reveals that the RTP's empirical 
probabilities are very close to the YH theoretical probabil- 
ities (indeed, there are two cases where these probabilities 
are identical). As it seems that the differences between the 
empirical and theoretical probabilities decrease as the num- 
ber of taxa increases, Vinh et al. (2011) suggest that the 
RTP process converges to the YH process as n (the num- 
ber of taxa) grows. The authors provided further evidence 
for their conjecture by comparing some properties of RPT 
trees with YH trees. Recall that a cherry in a tree is a pair 
of leaves that are adjacent to the same vertex. Then Vinh 
et al. (201 1) found that the mean and variance of the num- 
ber of cherries were similar under the RTP simulation and 
the theoretical value under the YH process (McKenzie and 
Steel 2000). 

Although Vinh et al. (201 1) provided evidence to sug- 
gest the two distributions appear to become very similar as 
n grows, they did not provide a formal statement or proof 
of their claim that the two distributions converge. In this 
project, we investigate the RTP process further using math- 
ematical and statistical methods. Our results demonstrate 
that certain properties of the trees that are near the 'periph- 
ery' of the tree (i.e. near the leaves) converge under the 
two distributions; however the 'deep' structure of the trees 
(how the tree is broken up around its centroid) appears to 
retain a trace that distinguishes the two models as the trees 
become large. 

2. Formalized Conjecture 

Given two discrete probability distributions p and q 
on Y , the total variational distance between p and q is de- 
fined as: 

<*VAR (p,q) = max \V p (A) - P,(A) | , 
where P p (A) = £ p(y) and P 9 (A) = £ q(y) are the prob- 

yeA yeA 

abilities of event A under the distributions p and q respec- 
tively. Thus d\/AR(p,q) is the largest possible probability 
difference of any event under the distributions p and q. 



A well-known and elementary result is that t/vAR (p,q) = 
— £ \p(y) —q(y)\> an d thus the two distribution are the 

same if dvAR = 0. 

A tree with the leaf set X n = {1,2, ... ,n} is called an 
X n -tree. In the rest of this article, all X„-trees referred to are 
binary trees, where the interior nodes have degrees of three. 
We use T„ to denote a labeled X n -tree topology, and t n to de- 
note an unlabeled X„-tree shape. Vinh et al. (201 1) suggest 
that when the number of taxa («) becomes large, RTP con- 
verges to the YH distribution. In this study, we consider 
the total variational distance between the tree topologies 
distributions between the RTP and the YH process, and 
formalize the conjecture from Vinh et al. (201 1). This for- 
malization states that the variational distance between the 
two tree distributions converges to zero as the number of 
taxa added grows. We first note that it makes no difference 
to the truth of this conjecture whether the trees are labeled 
or unlabeled. 

Lemma 1. Let S?(n) and =5^(n) be the set of labeled and 
unlabeled X n -trees respectively. For T n G S?(n), and t„ G 
y{n), let A„ := £ I P yh(7;) - Prtp(7;)| and 8„ := 

T„e^{n) 

) |PyH(fn) — PRTp(fn)|- Then, A n = 8„, and in partic- 

t n ey(n) 

ular lim A„ = lim 8„ = 0, as n — >• °°. 

Proof. Let v(t„) be the number of X„-trees T„ that have the 

shape t„. Then, for * G {YRRTP}, FJT„) = E±M we 

v(f„) 

have: 

A„= £ |Pyh(7;)-Prtp(7;)| 

= 1 I |PY H (r f ,)-PRTp(r„)| 

t n ey{n) T n e:J{n) 

T n has shape t n 

v V (,\ p YH(r„) P RT p(r„) 

tn e% V ^ V W 

= £ |PY H (f„)-PRTp(f«)| 

t„eS*(n) 

= 8n. 

□ 

Thus, we formalize the conjecture from Vinh et al. 
(2011) as follows: 

Conjecture (strong version) 

With A„ = 8„ defined as above, lim A„ = 0. 

n— !-<*> 

Note that, in the YH process, new leaves are only ever 
attached to pendant edges, and each pendant edge is se- 
lected with equal probability. We say that such leaves are 
attached to uniformly selected pendant edges. By contrast, 
the RTP process can attach new leaves to any edge, al- 
though RTP has an increasingly strong preference to attach 
leaves to pendant edges as the tree grows (Vinh et al. 201 1). 
These authors also suggested that as the tree grows, the 



number of cherries of a RTP tree follows the same limiting 
distribution as the number of cherries of a YH tree, which 
is normally distributed. We summarize these two claims as 
follows: 

Conjecture (weak version) 

1. Let S m be the event that all leaf attachments under 
the RTP beyond the first m leaves, are to uniformly 
selected pendant edges. Then P(<f m ) — > 1, as m tends 
to infinity. 

2. The distribution of cherries converges to the same 
(asymptotic) normal distribution as the YH model. 

In our paper, we prove the two parts of the weak con- 
jecture, and present statistical evidence that the strong con- 
jecture is not true. 

3. RTP is similar to YH when n is large 

To verify PartQ]of the weak conjecture, we need to es- 
tablish that the probability that a new leaf attaches to a pen- 
dant edge converges to 1 sufficiently quickly as the number 
of leaves increases. This requires that the pendant edges 
carry less weight than the interior edges. In addition, when 
the new leaf is added, all pendant edges must be equally 
likely to be chosen. Thus we must check the edge weight 
distribution during the puzzling step of the RTP process. 

3.1 Distribution of edge weights 

Let E„ denote the set of pendant edges of current Xn- 
tree T n and let E\ be the set of interior edges. For any 
edge e of T„, we let W(e) denote the random variable edge 
weight during the quartet puzzling step. Suppose edge e 
has k leaves of T„ on one side and n — k leaves of T n on the 
other side. The following result is established in the Ap- 
pendix. 

Lemma 2. W{e) is a binomial random variable with the 
parameters — ^ as the number of trials and I as the 
probability of success on each trial. 

The parameter k takes the value 1 or n — 1 for a pen- 
dant edge; for an interior edge, k lies between 2 an n — 2. 
Next, we show that for any fixed pendant and interior edge, 
the probability that the interior edge has lower weight con- 
verges to zero exponentially fast with increasing n. More 
precisely, for any e" E and any e' G E\, we establish the 
following result in the Appendix. 

W(W n (e")>W n (e')) <;2exp(-^»). (1) 

This result is for a fixed pair of pendant and interior edges, 
but it easily implies that the probability that the smallest 
weight in the tree is on a pendant rather than an interior 
edge converges quickly to 1 with increasing n. This is for- 
malized in the following inequality, also proved in the Ap- 
pendix: 




(2) 

Thus a new leaf is almost certain to be added to pendant 
edges; moreover, as noted above, each pendant edge has 
equal probability of being attached to. 



3.2 New leaves attach rarely to interior edges 

Theorem 1. Suppose T m € S?(m), let S m be the event that 
all leaf attachments under RTP beyond T m are to uniformly 
selected pendant edges. Then, for constants a,b > 0: 

n<$m)>l-ae- bm . 

Proof. Let be the event that (k + 1)— st leaf is not 
attached to any leaf edge of T k . Then we have 1 — 
P(<f m ) = F(\J^ =m B k ). By Boole's inequality, we have 
P(Ur=m%) < ir=,„ F (^)- % Inequality ©, P(B k ) < 
2£ 2 exp(— 5^gA;). We now use the following general in- 
equality, the proof of which is given in the Appendix. If 

Qm = £ k 2 exp(-ck), where c ^ ^ and fc > 1, then for 

k=m 

m ^ mo'. 



exp(— cmo/2) 
1 — exp(— c/2) 



(3) 



Thus, 

1-P(# m )< £2£ 2 exp(-^) 
2 



k=m 



1 — exp(- 
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exp(- 
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x —m). 
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Rearranging this inequality establishes the inequality in the 
theorem. The uniformity follows by Lemma|2] □ 



3.3 The mean and variance of the number of cherries in 
the RTP tree 

Table 3 of Vinh et al. (201 1) reveals that the mean and 
variance of the number of cherries on trees generated under 
the RTP process and under YH process are similar. In order 
to provide a formal proof that they converge to the same 
limiting distribution, we need to introduce the Extended 
Polyd urn model (EPU). 

3.3.1 Extended Polyd urn model Consider the follow- 
ing extended Polya urn (EPU) model: at time t — 0, there 
are b blue balls and r red balls in an urn, where b ^ and 
r 0. At each discrete time step, one ball is picked at ran- 
dom from the urn. If the ball is blue, c additional blue balls 
and d red balls will be placed; if the picked ball is red, e 
additional blue balls and / red balls will be placed. The val- 
ues c,d,e, f can also take negative values, in which case, 
instead of placing new balls in the urn, the number of balls 
of the appropriate colour will be withdrawn. We use b n to 
denote the number of blue balls after the nth draw, and S„ 
is the total number of balls. The following matrix describes 
this process: 

d 
f 

We require that A has positive and equal row sums, as well 



A = 



as one real positive principal eigenvalue A. Let be 

L v ' 2 

the normalized eigenvector associated with A. Then, un- 
der these conditions, a classic result states that, as n — » «>, 

b n —Xv\n 9 



1985), where — > denotes convergence in distribution. Cru- 
cially, the initial values of b and r do not play any sig- 
nificant roles in this limiting normal distribution (or of its 
mean and variance). 

3.3.2 EPU and attaching new edges only to pendant 
edges We relate the Yule process to the EPU model as 
follows: consider the set of cherry edges as a collection of 
blue balls, and the non-cherry edges as a collection of red 
balls. When a new edge is attached to a pendant edge, if 
it is attached to a cherry edge, the number of cherry edges 
remain the same, but the number of non-cherry edges in- 
creases by one. If a new edge is added to an non-cherry 
edge, then the non-cherry edge becomes a cherry edge, and 
the new edge is also a cherry edge. Thus, the generating 
matrix is: 

" 1 

2 -1 



A = 



Notice that A has row sum equal to 1 and A has one real 
positive eigenvalue A, as required. 

Let C„ be the number of cherries in a YH tree. Then 
as n tends to infinity, 

Z„:=(C„-«/3)/v/2n/45 
converges in distribution to a standard normal distribution 

(i.e. Z„ ^>7Y(0, 1)), by Corollary 3 of (McKenzie and Steel 
2000). We now show that the same holds for the distribu- 
tion of cherries in an RTP tree. 

Theorem 2. Let C* be the number of cherries in an 



RTP tree, and let Z* 
#(0,1). 



(C* - re/3 ) / ^2re/45. Then Z* 



^K(0,C7 2 ) (Mahmoud 2008; Bagchi and Pal 



Proof. We need to show that for any £ > 0, and for all 
sufficiently large value of re and all positive real x, 

\¥(Z* <je)-P(Z<je)| <£. (4) 

where Z is a standard normal random variable. 

As before, let S m be the event that after m leaves have 
been attached to the starting tree by RTP, all further addi- 
tions are to pendant edges, and let $f n be the complement 
of S m . For re > m, by the law of total probability, we have: 

p(z: <*) =p(z: <. t |4,)p(4)+p(z: <*iow 

(5) 

If we now subtract P(Z* < x\S ln ) from both side of Equa- 
tion (0, we obtain: 

P{Z*<x)-P(Z* <x\S m ) 
= ¥(Z* <x\<? m )(n<? m )-l) + P(Z* n <«P(0 

(6) 

By the triangle inequality (|a + b\ ^ \a\ + \b\) we have: 

ip(z: <i|4)(p(4)-i)+p(z: <*iW4S)i 

s: |P(z* <x\£ m )(P(<? m )-i)\ + \F(z*„ <«P(« 

(7) 

Combining Equation © and Inequality (IT) gives the fol- 
lowing: 

\P{Z*<x)-P{Z*<x\S m )\ 

^ mz* <x\g m )(n<? m )-i)\+\p(z* <*k)p(3S)|, 

< |p(z: <x\4 n )\\(P(4 n )-i)\ + \F(r H <x\^)\\n^ n )\. 

(8) 



Theorem Q] tells us that P(<£ m ) ^ 1 - ae~ im , which 
tends to 1 as m grows. Now, since P(<£^) — >■ as m tends 
to infinity, we can select a sufficiently large value of m that 

F (^m) < e/4 and P(<f m ) > 1 - e/4. Thus, ¥{S m ) - 1 ^ 
-e/4, and |P(<? m ) - 1| < e/4. Since sC P(Z* < x|<f m ), 
P(Z* < 1, Inequality © gives: 

|P(Z„* < x) - P(Z* < x|<f m )| < e/4 + e/4 = e/2, (9) 

for all sufficiently large m, and all n m and x > 0. 

Now we consider the sequence of Z* conditional on 
<f,„. By conditioning on this event all the new leaves are to 
uniformly selected pendant edges. Because the EPU argu- 
ment that established the convergence of the sequence Z„ 
(the normalization of the number of cherries in a YH tree) 
does not depend on the initial number of cherries for any 
e > 0, and every m, there exists an integer uq so that for all 
n ^ «q> and x > 0: 

|P(Z,; <x\£ m )-W{Z n <x)\ s: e/2. (10) 

Then, by the triangle inequality (\a + b\ ^ \a\ + \b\), if we 
add Inequalities (O and ( fTOb . we have 

\F(Z*<x)-P(Z„<x)\^e, 

and since Z„ converges in distribution to a standard normal, 
this establishes |@|. 

□ 

Theorem |2] shows that the number of cherries on the 
RTP trees has a limiting normal distribution with the same 
asymptotic mean and variance as for the YH distribution. 

We have also shown that, from some point forward, 
new leaves will always be added to pendant edges, which 
verifies the weak conjecture. While these two results may 
be regarded as providing some weak evidence in favour 
of the strong conjecture, they do not constitute any formal 
justification of it. In the next section, we will provide an 
analysis that suggests that the variational distance between 
the two distributions remains bounded away from zero as 
n grows, and this makes these two process distinct in the 
limit. 

4. Is RTP the same as YH? 

Consider the following scenario where we perform 
the YH process on some starting tree with more than three 
leaves, where v is one of the interior nodes. At node v, the 
graph is divided into three subtrees (see Fig. [2j. We let L,, 
(z = 1,2,3) denote the leaf sets of these subtrees, and let 
/, = |L;|, (z = 1,2,3) denote the number of leaves in the 
sets. We normalize the /, values by the total number of 
leaves n. Clearly, the sequence of U/n values change, as 
new leaves are gradually added to the whole tree. 

4. 1 Polyd urns and the centroid of a tree 

Adding new leaves on to the tree under the YH pro- 
cess ensures that each new leaf is always added into one 
of the leaf sets L,, (z = 1,2,3). The probability that /, in- 
creases by one is the relative proportion of the number of 
leaves of the subtree in relation to the number of leaves in 
the full tree. This is similar to the Polya urn problem (Karr 
1993) involving balls of three different colours. 



Suppose that one ball is picked randomly at each step, 
and replaced along with another ball of the same colour 
into the urn. Let F l n be the relative frequency of the zth 
colour ball when n balls are present, and F„ = (F t l , F„ ,F„). 
Then F„ converges (as n —> °°) to a Dirichlet distribution 
(Kotz et al. 2000) with the parameter vector F„ , where no 
is the total initial number of balls. Different initial values 
in the urn produce different distributions when n balls are 
present in the urn, and this difference in distributions does 
not converge to zero as n grows. This result suggests that 
the YH process on different initial X-trees may well lead 
to different distributions of the resulting trees. However, if 
the final tree shape is the only information we are given, 
then it will be impossible to identify the position of the 
original vertex v in the final tree with certainty. Thus the 
frequencies F„ cannot be clearly measured from the final 
tree alone. However, we can partly ameliorate this prob- 
lem by considering a particular vertex that we can easily 
identify in the final tree, namely its centroid (Jordan 1869; 
Mitchell 1978). 
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FIG. 2. — Centroid of a tree 



Definition. A vertex v of a tree T = (V,E) is a centroid 
if each component of the disconnected graph T\v has, at 
most (1/2)|V| vertices. 

A well known property of centroids states that a tree 
has either a single centroid or two adjacent centroids, in 
which case |V| is even (Kang and Ault 1975). To keep the 
problem simple, we only consider trees with a single cen- 
troid. However, because T is a binary tree, |V| is always 
even, and so this does not guarantee a unique centroid. For- 
tunately, the following lemma shows that a binary tree with 
odd number of leaves always has a unique centroid. 

Lemma 3. Let T be an unrooted binary X n -tree. Then: 

1. A vertex v ofT is a centroid ofT if and only if v satis- 
fies h,h,h ^ |, where U are the number of leaves of 
the three subtrees ofT\v. 

2. Ifn is odd, then T has a unique centroid. 

Proof. (1) Suppose that v is an interior vertex of T. Con- 
sider the vertex sets Vi, V2 and V3 of the connected 



components of 7"\v. Let /, be the number of leaves in 
Vj. Considering the rooted binary tree on V,, we have: 

\Vi\=2k-L (11) 

Also, since T is an unrooted binary tree, we have: 

\V\=2n-2. (12) 



Thus, 



\Vi\ ^ \\V\ if and only if 2Z; - 1 < \{2n - 2) 
and this holds precisely if Z; ^ n/2. Thus, the condi- 
tion for v to be a centroid (namely that \Vi\ ^ 5 \V | for 
i = 1,2,3) is precisely the same as that stated in the 
lemma. 

(2) Suppose v is a centroid of T. At v, we let Li, (i — 
1,2,3) denote the leaf set of the subtrees Tj and let 
U denote the size of these leaf sets, ordered so that 
lj ; ^ h *S ~Ti U 1 = 1 j 2). Since n is odd, we have Z3 < 5 . 

Suppose another centroid d exists. We use L' to denote 
the complement of L,-. Then there is a subtree H of 
T rooted at d, with leaf set L#, where Lj/ D G', and 
G' G {L[,L' 2 ,L' 3 }. Since lj < Z3 < §, where € {1,2}, 
we then have |L#| ^ |G'| > §. Therefore, of cannot be 
a centroid. 

□ 




!i = 1 

Non-Caterpillar (NC) Caterpillar (C) 

FIG. 3. — The two tree shapes for binary trees on seven leaves 

We now relate the centroid back to the Polya urn prob- 
lem. First notice that tree shapes only start to differenti- 
ate when there are more than five leaves. Therefore, in the 
following scenario, we perform the YH process from ini- 
tial trees with seven leaves. Suppose that a tree X is either 
the non-caterpillar (NC) or caterpillar (C) tree shown in 
Fig. [3] We will use X as the initial tree to construct some 
tree t„. At the centroid of t„ when n = 7 the sequences of 
k jn are (2/7,2/7,3 /I) and (1/7,3/7,3 /I) for f 7 = NC and 
t-j =C respectively. Now, let us only consider the number 
of leaves l\ in the smallest subtree of t„ for all odd values 
of n ^ 7 (henceforth all values of n in this section are odd 
to guarantee a unique centroids, and limits as n tends to 
infinity are also over just the odd values of n). We define 
the ratio of h and of number of leaves n as 7$ = For 
7 £ (0, 1), let II be the limiting probability of the event 
7Z* > 7. In other words, n x = lim P(?rf ^ 7). To test the 

null hypothesis that = n , we investigate the ratio 
7T„ under the YH process. An additional 2000 leaves are 
attached to the starting trees NC and C under the YH pro- 
cess with 1000 replicates each case. Using the initial tree 



NC or C, we found that the probability that 7t% is greater 
than 7 = 0.19 does not appear to be converging for the 
two choices of X (NC or C) (see Fig. |4|. Fig. [4] indicates 
the 95% confidence interval of proportions of the event for 
which 7T,f ^ 0.19, which suggests the following strict in- 
equality: 

u nc > n c (13) 
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FIG. 4. — Empirical probabilities and the 95% confidence interval 
proportion of the event that > 0.19. The dashed line is for the ini- 
tial tree of the non-caterpillar seven-taxa tree; and solid line is for the 
caterpillar seven-taxa tree. 



4.2 A modified RTP process 

To provide evidence that the RTP and the YH pro- 
cesses are not exactly the same, we define a new process 
RTP', which is equivalent to the RTP process up to n = 7. 
From this point forward it proceeds according to the YH 
process. Therefore, the initial probabilities of constructing 
X„-trees from NC and C under the RTP' process are dif- 
ferent from the YH process. We use the probabilities of 
the starting tree NC and C under the RTP process as the 
probabilities under the RTP'. Vinh et al. (2011) estimated 
by simulations that the probabilities for the seven-taxa non- 
caterpillar tree is 0.4607 under the RTP process and 0.4667 
under the YH process, which gives us the following in- 
equality: 

Pyh (Pi = NC) - P RTP , (f 7 = NC) > 0. (14) 
Theorem 3. If ( fL3l holds then 

lim dvAR(JW(f»),PYH(fc)) ^ 0. 
n— too 

Proof. Let 5?(n) be the set of unlabeled X„-tree and let: 



5':= £ |Pyh(*„)-Prtp'('«)I' 

t„ey(n) 

Consider the event E„ that n„ 7. Then: 

Pyh(£„)= £ F YH (L n \t 7 =X 

xe{NC,C} 

p rtp ,(e„)= £ p RTP ,(E„|f 7 = x): 



(15) 



?Yu(ti=X) (16) 

RTP'( f 7 = X) 



= X) (17) 



xe{NC,c} 



If we now subtract Eqns. ( fTTI i from (TToT l. and substitute 

P*(r 7 =C) in 

1 -P*(r 7 = iVC), we have: 

P yh (£„)-P R tp'(£„) 
= (P YH (f? = JVC) - P RTP . (f 7 = JVC) ) (n" c - n c ) . 

(18) 

Thus, if we apply inequalities ( fT4b and < TT~3T > in Eqn. ( fT8b . 
we obtain Pyh(£„) — Prtp'(^ji) > 0- Consequently, 8' > 
in l0~5i and so lim d V AR(PRTpK f «)jIPYH(Ai)) ^ 0, as 
claimed. □ 

It is important to be clear about what we have estab- 
lished: we have not formally shown that RTP does not con- 
verge to YH, nor even that RTP' fails to converge to YH. 
Rather, we have provided evidence that a certain property 
of RTP' holds, and if so, this implies (Theorem[3]l that RTP' 
does not converge to YH. Then, since RTP' is a hybrid of 
YH and RPT, this suggests that RPT does not either. 

5. Further discussion and concluding comments 

In phylogenetic studies, trees are inferred from DNA 
sequences using various methods. It is also pertinent to ask 
what sort of trees these methods would produce, given en- 
tirely random data. This is one of the motivations of the 
study by Vinh et al. (2011). In the following discussion, 
we use an n by k matrix D to denote a sequence of k in- 
dependent characters on n taxa. Note that all the characters 
have the same state space S. The term 'random data' can 
refer to any one of the following three schemes: 

(Rl). State x is assigned to taxon i in character j by an in- 
dependent, identically distributed (i.i.d.) process with 
a probability Pj{x), for x G S. 

When the probabilities of state x are the same for all char- 
acters (i.e. if fj(x) — p(x) for all j), we obtain a stronger 
notion as follows: 

(R2). For every entry of the matrix D, Djj is assigned to 
state x with probability p{x). 

If all states are equally likely (i.e. if p(x) = 1 /\S\), we ar- 
rive at an even stronger notion as follows: 

(R3). For all entries of D, all states have equal probabilities. 

Vinh et al. (201 1) suggest that random data imply that 
quartet trees are equally likely and independent to each 
other, stating: 

In our setting, we assume no phylogenetic infor- 
mation in the data. This is equivalent to the as- 
sumption that each of the three topologies for a 
quartet is equally likely and that the tree topol- 
ogy for each quartet is independent of the other 
quartets. . . . Hence, 3 U) possible combinations 
of quartet trees will serve as input to TP. 

For any of the models (R1)-(R3), it certainly is true that 
random sequence data provide equal support for all three 
possible topologies of any four taxa. However, this does not 
necessarily imply that the inferred quartet trees are exactly 



independent. Rather than persue this question here, we will 
consider the behavour of TP under a model in which quar- 
tet trees are i.i.d. and uniform, as in Vinh et al. (201 1). 

While the RTP process appears to converge close to 
the YH distribution, it is instructive to note that another tree 
reconstruction method, maximum parsimony (MP), when 
applied on random data, converges to a quite different dis- 
tribution on trees. Under model (R3) with two states MP 
converges to the PDA ('proportional to distinguishable ar- 
rangements') model, which selects each unrooted binary 
tree with equal probability. Let B(n) be the set of unrooted 
binary trees on the leaf set {1,2, . . .«}. For model (R3) with 
two states and k independent characters, we use £?mp(D) to 
denote the MP tree on D (if the MP tree for D is not unique 
then select one MP tree uniformly at random). 

Theorem 4. Under random model (R3) with two states: 

1. The random tree £?mp(D) has a PDA distribution on 
B(n); i.e. 

P(^ MP (D) = T) = -^. 

\B(n)\ 

2. For each fixed n, there is a unique MP tree for D with 
probability converging to 1 as k grows. 

Proof. 1. Let w(D,T), T E B(n), denote the parsimony 
score of T on random data D. By Theorem 7. 1 of Steel 
(1993), the number of ways to colour the leaves of a 
binary tree T with n leaves with using two colours, 
and so that the resulting colouration has parsimony 
score of k for T depends only on n and not otherwise 
on the tree T . Hence, for all T 6 B(n), the probability 
P(w(D, T) =1) = /(/), is the same for all binary trees 
with a given number of leaves. Therefore, each tree 
has the same probability of being an MP tree for D. 

Let Ek(T, T') be the event that T and T' have exactly 
the same parsimony score. By the Central Limit The- 
orem, the probability that the difference in scores is 
exactly (i.e. ¥(E/ ( (T, T'))) tends to zero as k grows. 

Let E be the event that the maximum parsimony tree 
for D is unique, and let E c be the complement, namely 
that there are at least two trees which have the same 
parisimony score for D. Note that E c is a subset of the 
union of the events Ek(T,T') over all T,T' (distinct). 
Therefore, we have: 

1 - P(£) < P( (J E k (T, T')) < £ F(E k (T, T')) -> 0, 

TJi T.T 1 

as k grows. Thus, P(E) — > 1 , as k — > °°, as required. 

□ 

Hence the MP tree on random data with two states 
converges to the PDA model. 

In the PDA model, new leaf nodes are uniformly 
added onto any edges of the existing tree, whereas the Yule 
tree selects a pendant edge randomly, and adds a new node 
onto this pendant edge. During the construction process, 
PDA, RTP and RTP' can attach some new leaves onto in- 
terior edges. For the PDA process, this has probability of 
almost 1 /2, and it is much less for RTP, as the number of 



leaves increases. In the case of RTP', beyond seven leaves, 
all further leaves are inserted to a pendant edge, just as in 
the YH model. 

In conclusion, we have verified that the RTP process 
will eventually not add new leaves onto interior edges after 
some point, which makes the RTP process become more 
like the YH process. However, the distance between two 
distributions appears to remain bounded away from zero 
even when n tends to infinity, which suggests that they are 
still two distinct tree construction methods. 
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Appendix: Technical details 
Proof of Lemma 2 

Proof. At edge e, suppose that A and B partition X n , where 
n — 1 ^ k ^ 1, \A\ = k and \B\ = n — k. Let {a,b,c} be a 
subset of X n of size three. Suppose that a new leaf x is to be 
attached to e. Let q be a split of {x,a,b,c}, and q = xc\ab, 
xa\bc, xb\ac with equal probabilities. Suppose, a and b are 
always on one side of e, we consider the following four 
cases, 

'easel c e B and {a,b} C A; 

case II {a,b,c}<ZB; 

case III c e A and {a, b} C B; 

k case IV {fl,i,c}CA. 

We use Qi, Qu, gin an d 2iv to denote the set of 
quartet trees on leaf set {x,a,b,c} in the case I, II, III 
and IV respectively, and let Q be the entire set of quar- 
tet trees for the leaf set of {x,a,b,c}. Since the four cases 
are mutually exclusive, QjS partition Q, i e {I, II, III, IV}, 



{n—k\ 
3 




and the sizes of g,s are \Q l \ = ( 2 ) . 

|em| = (V)xffi^d|eiv| = ©. 

Let w(e) be a random variable of the weight that is 
added to e for a quartet tree of {x,a,b,c}. Consider w{e) 
for each case {I, II, III, IV}. Then we have: 



case I and III: w(e) 



• case II and IV: w(e) = 0. 

Let Wi(e), i e {I, II, III, IV}, be the sum of all the 
weights added to the edge e. Wi(e) is a binomial random 
variable with parameters (*) ("7*) an ^ §! ^m( e ) i s a bi- 
nomial random variable with parameters ("2*) (1) an d f; 
W\\ = Wry = 0. Let W n (e) be the sum of Wi(e) values, so 
we have W„(e) = Wi(e) +W m (e). Let m = (*) ("7*), and 
n 2 then 

k(n-k)(n-2) 
ni+n 2 = - ^ ' 

and so W n (e) consists of this many independent trials with 
probability of success on each trial of |. That is, W n (e) is a 



binomial random variable with parameters — — ^ 

2 

3' 



and 



□ 



Proof of inequality (1) 

Let denote the set of pendent edges of current X„- 
tree T n , and E\ be the set of interior edges. 

Lemma 4. For any e" G E\ and any e' G E\, the expected 
pendant edge total weight W n (e") and the expected interior 
edge total weight W„(e'), satisfy the inequality: 

E[W n (e')] -E[W„(e")] > ~ [n 2 -5» + 6] > 0. (19) 

Proof. W„(e") and W„(e') are binomial random variables 
with the same probability of success |, but different num- 
ber of trials and *("-*)("- 2 ) , where k e {2, . . . ,n-2}. 
Thus 

nWn(e")] = l( n ~ l 



„r / 1,, 2 Hn — k)(n — 2) 
E[W„( e ')]=3 J £ 

For a fixed n, E [W„ (e 1 )] - E [W„ (e")] is a function of 
Therefore, to find the minimum of the difference between 
these two expected values, we need to find the value(s) of 
k for which E [W„(e')] - E [W„(e")] is minimal. 

Let y = (n - 2)(n - k)k - (n 2 -3n + 2), then ^ = 

(n -2){n- 2k). When k = f , % = 0, < 0. Thus, there 
is a maximum at k = |, and minimum occurs at k = 2 or 
k = n — 2. Therefore, when k = 2 or k = n — 2, 

i [n 2 -5n + 6] < E [W„(e')] - E [W„{e")] 

Moreover, it is easily shown that for n > 3, 
i [n 2 -5« + 6] >0. Therefore, 

E [W„( e ')] - E [W n {e")] >\[n 2 - 5n + 6] > 0. 

□ 

Theorem 5. For any e" 6 £^ ant/ any e' G 

P(W„(e") > W„(e')) < 2exp(-^n). 

Proo/ Let W„" = W„(e") - E [W„(e")}, 
Wj, = W n (e')-E[W n (e% and j3 = E [V„(e')] -E [W n {e")]. 
By Lemma [4] for n ^ 4, /3 > 2c/« 2 , where J 



48' 



Now, 

P (W„(e") > W n (e')) =P « - W„' ^ J3) 



€1 



<P W„" > 



^P (W„" ^ afn 2 



(-W' > dn 2 



We now apply Hoeffding's Inequality to the two terms 
on the right. Suppose that {Y,-,/ = l,2,3,...,iV} are inde- 
pendent Bernoulli random variables, and let Y — Y4L1 Yu 
By Hoeffding's Inequality (Hoeffding 1963), we have: 

F(Y- E(Y) > t) exp (-2t 2 /N) , 

¥(-(Y-E(Y)) ^f) < exp(-2t 2 /N) . 
Taking Y = W„' (and W„"), t = dn 2 , and jV = *("- fc )("- 2 ) in 
the previous string of inequalities, gives: 

F(W n (e")>W„(e')) ^2exp(- 1 «), 



Proof of Inequality (2) 

Proof. We will use Theorem [5] to establish Inequality 
(2). For e" € and e' G let D be the event that 
min {W„(e")} < min{W„(e')}, 

e"G£P e'EE}, 

Consider the complement of the event D, 
D c = fmin{W„(e")} < min{W„(e')}) , 

that is there is an interior edge e', such that 

W n (e') < min {W n (e'% W n (e') < W„(e"), Ve" G E v n . Let 

A e ii e i be the event that W„(e") > W n (e'), then we have, 
D c C A e // e /, and so 

(e",e')eP xl 



y(e",e')G^x^ 

According to Boole's inequality, 



U < E 



')■ (2°) 



(e",e')ePxI 



{e",e')ePxI 



Now, the number of pendent edge is n, i.e. \P\ = n, and 
the number of interior edge is n — 3, i.e. |/| = n — 3. Thus, 
|P x 7j = n(n - 3), and so, by Theorem |5] P^y) = 
P(W n (e") > W„{e')) < 2exp(-5ig«). Thus, 

£ P(A e » e ,) <n(n-3)2exp(-— n) < 2n 2 exp(--i-n) 



(c",e') e/>x/ 

Therefore, 



576 ' rv 576 

(21) 

min {W„(e")} < min{W„(e')} ) > l-2n 2 exp( ?-n). 

,'"G£P c'6£l / 576 

□ 



Proof of Inequality (3) 

Proof. Since ^p^/t) = /t 2 exp(-c/t/2), and 
fc 2 exp(— c/t/2) ^ 1 for c an( j k> l,we have: 

/t 2 exp(— c/t) < exp(— ^-k), where c ^ an( j ^ > ^ 

Thus I£ =m k 2 exp(-cfe) ^ ir=m ex P(-f where c ^ 

4 log A: 

fc ^ ,.- ry 

k—~ 

metric series, 



and k > 1 . where ^ exp( k) is the sum of a geo- 



f> P (-^)= exp( -; m/2) , 

t m V{ 2 ' l-exp(-c/2) 
For m ^ wo, exp(— cm/2) ^ exp(— cmo/2). Therefore, 

? " eXp( - C " } < l-exp(- C /2) ' Wh6re °^ and 



k 

k>l. 



□ 



^ 2excf- 



-n). 



